Content-oriented XML retrieval with HyRex

نویسندگان

  • Norbert Gövert
  • Mohammad Abolhassani
  • Norbert Fuhr
  • Kai Großjohann
چکیده

The eXtensible Markup Language (XML)1 is the emerging standard for representing knowledge in almost arbitrary applications. At least almost every kind of knowledge can be represented in XML. The major purpose of XML markup is the explicit representation of the logical structure of a document. From an information retrieval (IR) point of view, users should benefit from the structural information inherent in XML documents. The XML Information Retrieval Query Language (XIRQL) [Fuhr & Großjohann 01, Fuhr & Großjohann 02] has been developed to serve this purpose. XIRQL extends the XPath [Clark & DeRose 99] part of the (proposed standard) query language XQuery [Chamberlin et al. 01] by features important in IR style applications. For instance, IR research has shown that document term weighting as well as query term weighting are crucial concepts for effective information retrieval. XIRQL allows for term weighting with regard to the components of the documents’ logical structure. This is used for implementing the retrieval paradigm suggested by the FERMI multimedia model for IR [Chiaramella et al. 96]: Instead of treating documents as atomic units, we aim at retrieving those document components (elements) which answer a given information need in the most specific way. This strategy is used to process the content-only (CO) topics provided within the INitiative for the Evaluation of XML retrieval (INEX)2, where no structural conditions are used within the queries. Given the logical structure inherent to XML documents, users want to pose queries not only on content but also on the structure of the documents. The INEX content-and-structure (CAS) topics reflect that. As an extension of XPath, the XIRQL query language is capable of processing these queries. The Hyper-media Retrieval Engine for XML (HyREX)3 [Abolhassani et al. 02] provides an implementation of the XIRQL query language. In the following we describe its implementation with regard to processing the INEX CO and CAS topics. In Section 2 we show how ranking of most specific document components is done in HyREX, thus serving for processing the content-only topics. Section 3 details the algorithms used to produce such a ranking of document components while Section 4 displays the evaluation results of our approach. Section 5 shows how XIRQL concepts are used in order to process the CAS topics. In addition we give a brief overview on the concepts of data types and vague predicates which can lead to high precision searches, in combination with structural retrieval. A conclusion and an outlook on further research is given in Section 6.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bilingual Information Retrieval with HyREX and Internet Translation Services

HyREX is the Hypermedia Retrieval Engine for XML . Its extensibility is based on the implementation of physical data independence; its query interface on the conceptual level consists of data types with respective vague search predicates. This concept enabled us to add search predicates for the data type text for doing bilingual text retrieval. Our implementation uses free Internet resources fo...

متن کامل

What Do Users Think of an XML Element Retrieval System?

We describe the University of Amsterdam’s participation in the INEX 2005 Interactive Track, mainly focusing on a comparative experiment, in which the baseline system Daffodil/HyREX is compared to a home-grown XML element retrieval system (xmlfind). The xmlfind system provides an interface for an XML information retrieval search engine, using an index that contains all the individual XML element...

متن کامل

Modelling Vague Content and Structure Querying in XML Retrieval with a Probabilistic Object-Relational Framework

Many XML retrieval applications require relevance-oriented ranking of retrieved elements in order to capture the vagueness inherent to the information retrieval process. This relevance-oriented ranking should not only support vagueness at the content level, but also at the structural level. In this paper, we use a probabilistic object-relational framework to model representation and retrieval s...

متن کامل

XML Retrieval

DEFINITION Text documents often contain a mixture of structured and unstructured content. One way to format this mixed content is according to the adopted W3C standard for information repositories and exchanges, the eXtensible Mark-up Language (XML). In contrast to HTML, which is mainly layout-oriented, XML follows the fundamental concept of separating the logical structure of a document from i...

متن کامل

Content oriented retrieval on document centric XML

XML is the perfect format for storing (mostly) textual documents in a digital library; its flexibility enables users to store both highly structured data (like database records) and free text in the same document. The data-centric parts can be searched using query languages like XPath and XQuery, where exact conditions on the structure can be imposed. For digital libraries, however, it is impor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002